K-Nearest-Neighbors Induced Topological PCA for Single Cell RNA-Sequence Data Analysis

Single-cell RNA sequencing (scRNA-seq) is widely used to reveal heterogeneity in cells, which has given us insights into cell-cell communication, cell differentiation, and differential gene expression. However, analyzing scRNA-seq data is a challenge due to sparsity and the large number of genes involved. Therefore, dimensionality reduction and feature selection are important for removing spurious signals and enhancing downstream analysis. Traditional PCA, a main workhorse in dimensionality reduction, lacks the ability to capture geometrical structure information embedded in the data, and previous graph Laplacian regularizations are limited by the analysis of only a single scale. We propose a topological Principal Components Analysis (tPCA) method by the combination of persistent Laplacian (PL) technique and L2,1 norm regularization to address multiscale and multiclass heterogeneity issues in data. We further introduce a k-Nearest-Neighbor (kNN) persistent Laplacian technique to improve the robustness of our persistent Laplacian method. The proposed kNN-PL is a new algebraic topology technique which addresses the many limitations of the traditional persistent homology. Rather than inducing filtration via the varying of a distance threshold, we introduced kNN-tPCA, where filtrations are achieved by varying the number of neighbors in a kNN network at each step, and find that this framework has significant implications for hyper-parameter tuning. We validate the efficacy of our proposed tPCA and kNN-tPCA methods on 11 diverse benchmark scRNA-seq datasets, and showcase that our methods outperform other unsupervised PCA enhancements from the literature, as well as popular Uniform Manifold Approximation (UMAP), t-Distributed Stochastic Neighbor Embedding (tSNE), and Projection Non-Negative Matrix Factorization (NMF) by significant margins. For example, tPCA provides up to 628%, 78%, and 149% improvements to UMAP, tSNE, and NMF, respectively on classification in the F1 metric, and kNN-tPCA offers 53%, 63%, and 32% improvements to UMAP, tSNE, and NMF, respectively on clustering in the ARI metric.


Introduction
Single cell RNA sequencing (scRNA-seq) is a relatively new method that profiles transcriptomes of individual cells, revealing vast information in the heterogeneity within cell population, which has lead to further understanding of gene expression, gene regulation, cell-cell communication, cell differentiation, spatial transcrtiptomics, signal transduction pathways, and more [1,2].The workflow of a typical scRNA-seq analysis involves single cell isolation, RNA extraction and sequencing using a library and downstream analysis.With the technological improvements, more than 20,000 genes can be profiled, which has led to a high-dimensionality challenge.Despite the improvements in the methodology that allows for more accurate reading of genes and increasing the number of sequenced cells per experiment, analyzing the data for downstream analysis remains a hurdle.Numerous methods and procedures have been proposed to analyze the data [3][4][5][6][7][8][9].Specific challenges in scRNA-seq data analysis include drop-out events-induced zero expression read count, inadequate sequencing depth-induced inconsistent low reading counts, noise data, and high dimensionality [8,10].Therefore, dimensionality reduction and feature selection to eliminate low signals is an essential step in analyzing scRNA-seq data.
Various dimensionality reduction and feature selection methods have been proposed for analyzing scRNAseq data.ScRNA by non-negative and low rank representation (SinLRR) assumes that scRNA-seq data is inherently low rank and finds the smallest ranked matrix that approximates the original data [11].Single-cell interpretation via multikernel learning (SIMLR) utilizes multiscale kernel to learn a cell-cell similarity metric that can be used for downstream analysis [12].Deep learning has also been used to perform dimensionality reduction [13][14][15][16].
Traditional dimensionality has also been widely incorporated into scRNA-seq analysis pipeline.Nonlinear dimensionality reduction, such as uniform manifold approximation and projection (UMAP), t-distributed stochastic neighbor embedding (t-SNE), multidimensional scaling (MDS) and isomap have been utilized for visualization [17][18][19][20].However, directly applying such algorithm can be challenging because these methods rely on distance calculation and data sparsity, but high dimensional scRNA-seq data may suffer from poor distance calculations.Recently, correlated clustering and projection (CCP) has been used on scRNA-seq data and its visualization [21,22].CCP utilizes gene-gene correlation to partition genes, and uses cell-cell correlation on the partitioned genes to project the original genes into super-genes.Non-negative matrix factorization (NMF) has been widely utilized due to its interpretability.NMF uses matrix factorization, where the basis can be interpreted as meta-gene, which are weighted sums of the original genes.Numerous NMF with various constrains has been proposed for scRNA-seq [23,24].One of the oldest dimensionality reduction methods, principal components analysis (PCA) is still one of the most popular method used for scRNA-seq [25].
PCA is a linear dimensionality reduction algorithm, where the goal is to compute a orthonormal basis that maximizes the variance of the projected data.The first component is call the principle component, where the variance of the projected data is maximized.The subsequent components i are orthogonal to the i − 1 components and maximizes the variance of the residual of the approximation.Though PCA is popular because of its efficiency and is linear due to the having no reliance on utilizing metric computation, it has its drawbacks.First, PCA lacks concrete interpretability because of its eigenmode-representation the data.In scRNA-seq, there are many 0-counts in the data, i.e. the gene is not expressed, which can contribute to the principle component and the coefficient matrix can be dense.Second, PCA assumes that the noise of the data is Gaussian, which may not model scRNA-seq data well because the original data is a count matrix.To tackle the first problem, sparse PCA (sPCA) was introduced [20], where by adding a l 1 penalty term on the basis, it allowed for sparsity in the principle components.An alternative formulation of PCA was introduced by Nie et.al. [26] to improve robustness of PCA by assuming the noise is sampled from the Laplace distribution, which is called rPCA.Graph regularization was further introduced by Jiang et.al. which utilizes a graph Laplacian to incorporate nonlinear manifold structure to the reduction [27].However, graph Laplacian can only capture a single scale, and is not able to capture the topological structure of the data.
We introduced persistent Laplacian-Enhanced PCA theory in previous works, which was shown to outperform all other state-of-the-art PCA enhancements for microarray data analysis [28].Persistent Laplacian, also called persistent combinatorial Laplacian or persistent spectral graph, was introduced as a new generation of topological data analysis (TDA) methods in 2019 [29].It has stimulated a variety of theoretical developments [30][31][32][33] and led to remarkable applications [34][35][36].PLPCA has the ability to recognize the stability of topological features in our data at multiple scales, and provide a more thorough spatial view through the filtration of a simplicial complex, which induces a sequence of simplicial complexes.Accounting for the spectra of each corresponding Laplacian matrix for each complex in the sequence enables us extract this topological information, improving our ability to preserve intrinsic geometrical information during dimensionality reduction.We can accomplish this by constructing a weighted sum of each Laplacian matrix, generating an accumulated spectral graph.However, PLPCA requires intensive parametrization.
The objective of the present work is to introduce L 2,1 norm regularization to improve the sparsity and heterogeneity constraint in PLPCA.The resulting technique, called topological PCA (tPCA), allows us to significantly reduce the distribution of the weightings in the accumulated spectral graph, while still achieving near optimal performance.Additionally, we introduce a k-nearest neighbor (kNN) persistent Laplacian algorithm to improve the robustness of our tPCA.The performance of resulting kNN-tPCA does not depend on parameter search.We extensively validate the performance of tPCA and kNN-tPCA for clustering and classification on a series of 11 scRNA-seq datasets.We demonstrate that our new method is superior to other PCA enhancements as well as NMF.
Our work then proceeds as follows.First, we review the mathematical formulations of each of the enhancements that build up to tPCA.Specifically, graph regularization and sparseness.We then discuss the tools for perssistent Laplacian theory which we will incorporate to arrive at our final procedure: RpLSPCA, or tPCA.We then validate the performance of this new method on a set of 11 scRNA-seq datasets against other notable PCA enhancements, as well as NMF, for clustering and classification.Extensive tests indicate that our methods are the state of the art procedure for dimensionality reduction prior to clustering and classification.Specifically, over the 11 tested datasets, tPCA outperforms UMAP on average by 628% on the F1 metric.Lastly, we visualize the tPCA Eigen-Genes via UMAP and tSNE, as well as Residue Similarity Plots to further assess the performance of our methods.

Methods
In this section, we provide an overview of PCA and its derivatives, including sparse PCA (sPCA), and graph Laplacian regularized sparse PCA (gLSPCA).We then introduce our method, topological PCA (tPCA) and a less parameter-intensive alternative kNN-tPCA.We will first define the notation used in the following subsection.

Notations
We summarize our notations as follows.
1. X = {x 1 , ..., x N } ∈ R M ×N , where x j ∈ R M indicates the jth sample or cell, and M is the number of genes.

∥A∥
A ij is the frobenius norm of matrix A.

∥A∥
where the summation is taken after taking the l 2 norm of each column.Alternatively, we can think of this as taking the l 2 -norm of the genes and taking the sum over the cells.4. Tr(A) is the trace of matrix A.

5.
Let m << M the dimension of the subspace, and M is the number of original dimension, ie the number of genes.
6. Let N be the number of samples or cells.
7. U ∈ R m×M is the principle components, or the basis of the lower dimensional subspace of X, and m is the number of dimension.
8. Q ∈ R m×N is the projection of X onto the subspace spanned by U .

PCA
Principal Component Analysis has historically served as the baseline approach for dimensionality reduction during gene expression data analysis [37].The goal of PCA is to express some high dimensional data X ∈ R M ×N in a lower dimensional space.This is accomplished via computing the principal components, which are the eigenvectors of the covariance of X corresponding to the largest eigenvalues.Alternatively, we can express PCA as finding a m-dimensional subspace that approximate the data matrix, i.e., min where Q T Q = I N is the orthonormal constrain, and I N is the N × N , or cell by cell, identity matrix.When the original matrix X is 0-mean 1-variance scaled, we can see that this formulation is equivalent to take the eigenvectors of the covariance matrix.Alternatively, the orthonormal constrain can be applied to U , which yields traditional PCA.

Sparse PCA
PCA requires that the principal components be expressed as linear combinations of all the features with non-zero weightings.However, in the context of gene expression data analysis, this introduces unnecessary computational complexity and noise, because many genes are not expressed in the cells, or is only expressed under particular circumstances [3].Therefore, the interpretability of PCA is significantly aided by the introduction of Sparse PCA (sPCA), which allows for zero weightings [20].Sparse PCA can take several forms, notably the inclusion of an L 2,1 norm penalty term in the objective function as in Eq. 2: min The L 2,1 norm is defined as ∥A∥ 2,1 = n i=1 ∥a i ∥ 2 , or first calculating the L 2 norm of each row, and then computing the L 1 norm of row-based L 2 norms.The β parameter scales the sparse regularization term.

Graph Laplacian Regularized Sparse PCA
While the inclusion of sparse regularization should address some of the shortcomings of PCA relating to interpretability, it does not address the inability of PCA in recognizing complex geometric structures which are present in the higher dimensional space.Graph Laplacian has been commonly used to incorporate geometric information into the reduction such that similar samples in high dimension will be closer in the lower dimensional embedding.This can be accomplished with neighbor graphs with pairwise edges, specifically, the Laplacian operator [38].
Let G(V, E, W ) be a nearest neighbor graph, where V is the set of vertices, E is the edge, and ω are the weights of the edges.E can be defined as E = {(x j , x i ) : x i ∈ N k (x j ) or x j ∈ N k (x i )}, where N k (x j ) is the k-nearest neighbors of sample j under some metric.For pairs of points in the edge set, we can define a weight satisfying the following two properties Such condition is satisfied by a class of functions called radial basis functions.For this work, we utilize the Gaussian kernel as the edge weights The matrix W is known as the weighted adjacency matrix.Here, η ∈ R defines the geodesic distance, or the width of the Gaussian kernel.We can then define the Graph Laplacian L, by taking where D is the degree matrix, defined as the row sum of W , which shows the total connectivity of the vertex i. Laplacian graph provides a graphical embedding, which can be used as a regularization for PCA [27].
In order to incorporate manifold regularization into the PCA framework, consider the distance ∥q i −q j ∥ 2 , where q i and q j correspond to the lower dimensional representation of samples x i and x j , respectively.Using the graph weights W ij , we see that if W ij → 1, ie x i and x j are similar, then ∥q i − q j ∥ → 0. Alternatively, if W ij → 0, ie x i and x j are dissimilar, ∥q i − q j ∥ 2 → ∞.Using this fact, we want to minimize the following.
Utilizing the sparse PCA and the manifold regularization, we obtain the graph Laplacian sparse PCA (gLSPCA) min We also observe that the loss function of gLSPCA is Frobenius norm regularization, which is sensitive to outliers and data heterogeneity when dealing with multiclass data.We then propose replacing Frobenius norm regularization with L 2,1 norm regularization to achieve robustness.This yields the following new objective function, which we call Robust graph Laplacian Sparse PCA (RgLSPCA), min

Topological PCA
While the inclusion of sparseness and graph Laplacian regularization seeks to address interpretability and geometric structure capture, it still lacks the ability to recognize the stability of topological features at multiple scales, as well as homotopic shape information [29].To this end, we turn to persistent Laplacian regularization.Like persistent homology, persistent spectral graph theory tracks the birth and death of topological features of data as they change over scales [30,39].We perform this analysis via a filtration procedure on our data to construct a family of geometric structures [29].We then can study the topological properties of each configuration by its corresponding Laplacian matrix.
First, we must briefly review the notion of a simplex, simplicial complex, q-chain, and boundary.A 0-simplex is a vertex, a 1-simplex is an edge, a 2-simplex is a triangle, and so on.Generally, we consider a q-simplex, σ q .A simplicial complex is then a means of approximating a topological space by gluing together the faces of simplices.More formally, a simplicial complex K is a collection of simplices such that: 1.If σ q ∈ K and σ p is a face of σ q then σ p ∈ K.

The nonempty intersection of any two simplices is a face of both simplices.
A q-chain is then defined as a formal sum of q-simplices in a simplicial complex K with coefficients in Z 2 .The set of q-chains has a basis in the set of q-simplices in K, and this set forms a finitely generated free Abelian group C q (K).We then define the boundary operator as a homomorphism relating the Chain groups, ∂ q : C q (K) → C q−1 (K).The boundary operator is defined as: where σ q−1 is a q − 1 simplex.The sequence of chain groups connected by this homomorphism is then a Chain Complex: ...
It is well known that the boundary operator and the Chain Complex associated with a simplicial complex gives the number of q-dimensional holes in that topological space.Specifically, the qth Homology Group is defined as H q = ker∂ q /Im∂ q .This is also known as the qth Betti Number, β q .The matrix representation of the qth boundary operator with respect to the standard basis in C q (K) and C q−1 (K) is given as B q .Besides considering the Homology of our topological space, we can also consider its cohomology.To that end, we define the adjoint operator of ∂ q as: and the transpose of B q , denoted B T q , is the matrix representation of ∂ * q with respect to the same basis.We can now define the q-combinatorial Laplacian matrix as: The harmonic spectrum of the q-combinatorial Laplacian matrix reveals the dimension of the qth Homology group, or the number of q-dimensional holes in our simplicial complex.The non-harmonic spectrum then reveals further homotopic shape information [40].Intuitively, β 0 reveals the number of connected components in K, β 1 reveals the number of loops in K, and β 2 reveals the number of 2D voids in K.
However, this framework is confined to the analysis of only a single simplicial complex, or the connectivities at only a single scale.To enrich our spectral information, Persistent spectral graph theory proposes creating a sequence of simplicial complexes by varying a filtration parameter [40]: For each subcomplex K t we can denote its chain group to be C q (K t ), and the q-boundary operator ∂ t q : C q (K t ) → C q−1 (K t ).By convention, we define C q (K t ) = {0} for q < 0 and the q-boundary operator to then be the zero map.The boundary operator and adjoint boundary operator are otherwise defined similarly as before for each K t in the sequence, which allows us to define a sequence of Chain Complexes.
Next, we introduce persistence to the Laplacian spectra.Define the subset of C t+p q whose boundary is in C t q−1 as C t,p q , assuming the natural inclusion map from On this subset, one may define the p-persistent q-boundary operator denoted by ∂t,p q : C t,p q → C t q−1 and corresponding adjoint operator ( ∂t,p q ) * : C t q−1 → C t,p q , as before.The matrix representation of the p-persistent q-boundary operator in simplicial basis is then B t,p q+1 , and the matrix representation of the adjoint operator is again the transpose.This allows us to define the q-order p-persistent Laplacian matrix as: We may again recognize the multiplicity of zero in the spectrum of L t,p q as the q'th order p-persistent Betti number β t,p q which counts the number of (independent) q-dimensional voids in K t that still exists in K t+p [29].We can then see how the q'th-order Laplacian is actually just a special case of the q'th-order 0-persistent Laplacian at a simplicial complex K t , or rather, at a snapshot of the filtration.
We can capture a more thorough view of the spatial features of our data by focusing on the 0-persistent Laplacian as we have done in previous works [28].Specifically, we calculate Vietoris-Rips complexes by varying a filtration parameter on the weighted entries of our Laplacian matrix, which correspond to the weighted edges in our graph structure.By gradually increasing a distance threshold, we induce a sequence of simplicial complexes and subgraphs to analyze.In previous works on Persistent Laplacian-enhanced PCA, we provided a convenient computational method for this.For a graph Laplcian matrix L, observe: For i ̸ = j, let l max = max(l ij ), l min = min(l ij ), d = l max −l min .Set the t th Persistent Laplacian L t , t = 1, ..., p: Here, We then weight each L t in the sequence and sum to consolidate each subgraph into a single term, denoted P L. This new term should encode the persistence of topological features as the filtration progresses over multiple scales The optimal ζ weightings are hyper parameters which are obtained via Grid Search.Ideally, we should recognize which scales of connectivity contribute the most important information to our analysis, and place greater emphasis on that corresponding Laplacian matrix in the sum.More details regarding this procedure can be found in our previous works [28].Substituting this P L term into Eq.7 gives rise to Robust Persistent Laplacian Sparse PCA (RpLSPCA) which is obtained via the following optimization formula: min For convenience, we can also refer to this method simply as Topological PCA (tPCA).This method should better retain geometrical structure information by emphasizing topological features that are persistent at multiple scales through the harmonic spectra, while the non harmonic spectra contributes other geometric shape information.

KNN Induced Persistent Laplacians
Rather than varying a distance threshold on the entries of our weighted Laplacian matrix to induce filtration, we can instead vary the number of neighbors we use to construct each graph structure.This construction has been used in the past to establish kNN-based persistent homology techniques, and extends naturally to the construction of a sequence of Persistent Laplacians [41].Observe Figure 1.We then can replace PLs in Eq. 16 by kNN-PLs  We vary k from 1 to p to establish a sequence of p Persistent Laplacians with different connectivities at each scale of the filtration.Again, diagonal entries are set equal to negative row sums, giving the total connectivity of each vertex.While the Vietoris-Rips based construction requires a choice of scale parameter for the Gaussian Kernel weighting of our graph Laplacian for the sake of normalization, the kNN construction eliminates the need for this since the filtrations do not depend on varying a distance threshold [41].Furthermore, the sparseness of scRNA-seq data can affect distance measurements, leading to difficulties in establishing meaningful connections between cells when reliant on a distance metric.By inducing filtration through varying the number of neighbors rather than a distance threshold, the result is then a more standardized sequence of connectivity.We also tend to observe the formation of more connected cycles via this method, which lends itself to richer, more interesting topological features in our data.We now validate the performance of these new methods against other enhanced PCA procedures, as well as NMF and tSNE, on several benchmark scRNA-seq datasets.

Data and Preprocessing
Table 1 shows the summary of the data, including the GEO accession ID, reference, source organism, number of samples, number of genes, and number of cell types.For each dataset, log-transform was applied, and genes with values less than 10 −6 were set to zero.Afterward, 20% − 25% of the lowest variance genes were dropped.In certain instances where a class had fewer than 15 samples, we dropped that class from the analysis.For the PCA methods, values were then demeaned and scaled by the standard deviation, while for NMF, values were standardized by sklearn's NMF function.Note the differing dimensionality and number of cell types of each dataset, underscoring the comprehensiveness of our proposed methods.ARI describes how well two clusterings agree with each other by comparing pairs of data points and their respective class assignments.It also can account for possible random agreement and adjusts the similarity score accordingly, assigning a value in the range of -1 to 1.A value of 1 indicates a perfect agreement between clusterings, a value of 0 indicates a random chance agreement, and a value of -1 suggests that the clusterings are less similar than they would be by chance.For two clusterings X = {X 1 , ..., X r } and Y = {Y 1 , ..., Y s }, we construct a contingency table A ∈ R r×s with elements a ij which describe the overlap between X i and Y j .We then take row sums and column sums to obtain another set of values: {q 1 , ..., q r } is the set of row sums and {p 1 , ..., p s } is the set of column sums.We can then define the Adjusted Rand Index as:

Normalized Mutual Information (NMI)
Mutual Information considers a split of the data according to clusters and a split according to true class labels, and measures how these splittings agree with each other.NMI then corrects for any bias and normalizes the scores between 0 and 1.A value of 1 indicates a perfect agreement between the splittings while a value of 0 indicates random chance agreement.The mathematical definition of NMI is given as: NMI(T, P ) = MI(T, P ) (E(T ) + E(P ))/2 (20) Where MI(•, •) and E(•) represent mutual information and entropy and T, P represent the true and predicted cluster labels, respectively.

Classification Metrics
Regarding the evaluation metrics used to measure performance for classification tasks, beyond simple accuracy, there are several metrics that are commonly considered.Notably, Precision, Recall, and F1-Score.
Below we list the mathematical definition of each.We note that the F1-score is particularly relevant as it accounts for both precision and recall, making it more robust to class imbalances within the data.In our case, there are noticeable imbalances between cell types in each of the tested dataset, so we emphasize this metric as the most informative in measuring performance.

Recall
Given that our datasets generally contain multiple classes, the evaluation criterion we employ is the mean for each cell type.This evaluation approach is commonly known as a macro metric, where performance measures are calculated for each cell type individually and then averaged to obtain an overall score.

Residue Similarity Scores
To enhance visualization, Residue Similarity (RS) scores can be computed [21].Traditional visualization techniques often involve reducing the data to two or three dimensions, which may result in the loss of structure and integrity in multiclass data.R-S plots were introduced as a method to visualize results while better preserving the underlying structure of the data.
An R-S plot consists of two main components: the residue score and the similarity score.The residue score is calculated as the sum of distances between classes, capturing the dissimilarity between them.On the other hand, the similarity score represents the average similarity within each class, indicating the degree of similarity between instances belonging to the same class.By considering both scores, R-S plots provide a comprehensive representation of the data's structure in a visualization.

Given data of the form
, we have y i representing the class label of our ith data point ⃗ x i ∈ X. Say that our data has N samples, M features, and L classes.We can then partition our dataset X into subsets containing each of the classes by taking C l = {⃗ x i ∈ X|y i = l}.For each class l we then define the residue score as follows: where ∥•∥ denotes the Euclidean distance between vectors and R max is the maximal residue score for that subset.The similarity score, meanwhile, is given as: where d max is the maximal pairwise distance of the dataset.For constructing R-S plots, we then take R(⃗ x) to be the x-axis and S(⃗ x) to be the y-axis.

Comparison of tPCA and Other Methods for KMeans Clustering
We tested tPCA's performance on the datasets described in Table 1, and compared it to PCA, sPCA, RgLSPCA, NMF, UMAP, and tSNE.For this analysis, we performed K-Means clustering after reducing the dimensionality of each dataset such that the number of dimensions equals the number of clusters.To ensure the greatest accuracy in our analysis, we used sklearn's KMeans function, and increased the number of times the k-means algorithm is run with different centroid seeds from 10 to 150.We then performed the clustering with 30 random instances and considered the average performance.To further facilitate a fair and accurate comparison, we utilized sklearn's NMF function, initialized as non-negative random matrices with values scaled by the square root of the mean of X and divided by the number of components.We also increased the number of maximum iterations from 200 to 300, and took the average performance over 20 random initializations.Likewise, we also compare our method to sklearn's tSNE over 20 random intializations with maximum iterations increased to 300.For UMAP, we used the default minimum distance allowed for packing points together in the embedding space, the low value generally should improve clustering performance by providing a cleaner separation between clusters.We used the default number of nearest neighbors in UMAP's kNN structure, which was 15.This value is the same as the initial number of neighbors used to describe the manifold structure in tPCA.
We do, however, note certain weaknesses with NMF, UMAP, and tSNE in this analysis.First, methods such as NMF tend to perform poorly when reducing to such low dimensionality, while methods such as tSNE and UMAP have notable issues with structure preservation.Specifically, neither method is capable of strictly preserving the density or distance in our data.Therefore, we expect that our procedure should prove substantially more effective as a preprocessing step than any of these methods for clustering via KMeans or classifying via K-Nearest-Neighbors, as both models are reliant on the intrinsic distance and density in the data.For the clustering analysis, a summary of our results for Adjusted Rand Index and Normalized Mutual Information can be found in Tables 2,3.For kNN-Induced tPCA, when results are shown as (•) * , this implies that a generic connectivity weighting was used, and there was no parameter optimization needed on that dataset to obtain nearly optimal performance.We see from the results in Table 2 that persistent Laplacian-enhanced PCA is able to outperform not only the other enhanced PCA procedures, but also NMF, UMAP, and tSNE, on all of the tested datasets for Adjusted Rand Index.We specifically note the superior performance on GSE82187.our tPCA outperformed RgLSPCA's ARI score by a considerable margin of 0.231.These results clearly indicate the superior ability to preserve local non-linear geometric structure achieved by persistent Laplacian regularization compared to graph Laplacian regularization, and the result is superior performance in clustering analysis.Furthermore, in several instances KNN-Induced Laplacian regularization was able to match the performance of the standard construction without any optimization.Overall, it was shown to at least outperform the other procedures on all but one tested dataset, and with a fraction of the effort required for parameter search.Once again, the results in Table 3 showcase the superiority of tPCA in all tested cases for NMI.We again note the remarkably superior performance of our method on the GSE82187 dataset specifically, where we outperform RgLSPCA in NMI by 0.089.Again, we also note the ability of the kNN-tPCA to provide optimal or near optimal performance compared to the distance based construction while not requiring an extensive parameter tuning procedure.Now, we can average the performance in each metric over all tested datasets to reveal the extent to which tPCA and kNN-tPCA outperform the other methods overall, across our 11 tested datasets.From the depicted image it is clearly evident that both methods for tPCA are superior to all other tested dimensionality reduction techniques, particularly other PCA enhancements.We especially emphasize the superiority of kNN-tPCA given the significantly reduced need for parameter optimization with this method.These results strongly reaffirms the importance of incorporating the topological information and multi-scale analysis that is possible with topological PCA into a dimensionality reduction technique.While previous techniques can capture some geometrical structure information through graph Laplacian regularization, we see that incorporating the additional filtrations greatly improves performance.Having confirmed the efficacy of our methods, we can move to examining the impact that kNN-induced filtration has on the scale of parameter tuning in more detail, as well as comparing different visualization techniques of the Eigen-Genes each method produces.

Comparison of kNN-tPCA and Other Methods for Classification
To further validate the efficacy of our proposed method, we can supplement these clustering results with a classification study using kNN.The classification of various cell types begins by randomly splitting our gene expression data into training and testing sets.The kNN model is trained on 60% of the data, and then tested on the remaining 40%.To mitigate the impact of data distribution, we employed a 5-fold cross-validation approach.The classification accuracy was calculated as the average performance over five repetitions.The mean accuracy of the classification was then recorded for subspace dimensions ranging from {100, 90, ..., 10, 1}.The results of this analysis can be seen in Tables 5 and 6.We see from this first round of results that kNN-tPCA provides a stellar improvement to performance metrics for classifications using kNN, carried out after dimensionality reduction.Notably, we observe a 2.95% improvement in F1-Score when compared to the standard graph regularization in tPCA and a remarkable 17.5% improvement when compared to traditional PCA.Compared to other dimensionality reduction techniques such as UMAP, tSNE, and NMF, the results are even more significant.This demonstrates the comprehensiveness of tPCA in being able to reduce data to a variety of embeddding dimensions while also preserving important structural information in the data.The results in Table 5 as well as those in Table 6 demonstrate that kNN induced Persistent Laplacian regularization for PCA is a comprehensive method capable of enhancing the performance of classification tasks for Single Cell RNA-Sequence data analysis.Combined with the results in Table 4, we can conclude that tPCA is a superior dimensionality reduction technique for a variety of Machine Learning methods.
To intuitively illustrate these results, in Figure 3 we have provided an illustration depicting the distribution of Accuracy and F1 performance between PCA, RgLSPCA, and kNN-tPCA as we vary the dimensionality of our reduced space on GSE82187.This clearly indicates the superior performance of our proposed method across a wide range of reduced dimensions, especially as the number of dimensions grows larger, where PCA typically suffers from stability issues.This clearly further validates our findings.When we average each of the performance metrics over the 11 tested datasets, we can assess the total improvement that our method provides compared to other techniques as shown in Table 7.We can then intuitively visualize these results by examining Figure 4, which clearly showcases the superiority of kNN-tPCA for classification tasks on a variety of datasets with different dimensionalities and data imbalances.We specifically note that, on average, kNN-tPCA outperforms RgLSPCA by a margin of 1.39% for Macro-ACC, 1.88% for Macro-PRE, 2.82% for Macro-REC, and 2.91% for Macro-F1.This clearly demonstrates the benefits of incorporating multi-scale analysis through the inclusion of persistent Laplacians.Furthermore, we note that our method outperforms traditional PCA by up to a considerable 18.3% for these metrics over the 11 datasets.

Parameter Analysis
Regarding the optimization of hyper-parameters for tPCA, we especially note the presence of the ζ weights in the P L term: Which must be manually chosen for each dataset depending on the connectivity information that is most important.Specifically, for p filtrations we generally consider a distribution of {1, 1/2, ..., 1/p, 0} and perform a parameter search over this distribution, while also simultaneously searching for an optimal γ value.However, for larger p this clearly becomes an extremely computationally intensive task, with the number of parameter combinations equaling ((p + 1) p )(Size of γ distribution).Therefore, rather than considering all parameter combinations over this distribution at once, we can instead consider different combinations of scales of connectivity, say, long, middle, and close range.In other words, for 7 filtrations, testing combinations of p = 7, 5, 3, and from there recognizing which scales contribute the most valuable information to narrow our search.Doing so reduces the number of combinations from ((p + 1) p )(Size of γ distribution) to (m)((p + 1) 3 )(Size of γ distribution), where m is the number of connectivity combinations we need to test to achieve the best results.In practice, we found that generally m = 3 obtained allowed us to obtain optimal performance, which is a considerable improvement from the traditional approach to grid search, though clearly still not preferable for practical purposes.
Ideally, the more standardized filtrations present with kNN Laplacians will reduce the need for parameter optimization entirely, significantly reducing computation and time requirements.As opposed to performing grid search for each dataset, we can universally choose a given set of weights that decrease as connectivity information decreases, such as {ζ t = 1/t}, t = 1, ..., p, and observe whether there is still a meaningful improvement in performance without the need for any parameter search.In case there is still need for some optimization, we can at least significantly restrict the parameter distribution, decreasing the amount of time needed for tuning.Specifically, we weight connectivities as being either unimportant (ζ = 0), or important (ζ = 1).This results in the number of tested parameter combinations equaling (2 p )(Size of γ distribution).Ultimately, in practice with 8 filtrations we found that this meant fewer than 1/5 the amount of tested parameter combinations were needed to obtain optimal or near optimal results compared to the original construction.In most cases, however, no parameter search was even necessary at all.In Figure 5, we include a chart illustrating the scale of the respective parameter searches as we vary the number of filtrations.We see from this that, for a reasonable number of filtrations, the parameter search necessary for the kNN construction is a fraction of that needed for the standard construction, while the results listed in Table 4 showcase that the performance is still optimal or at least near optimal compared to other dimensionality reduction techniques.
The β and γ parameters are similarly found via grid search.In Figure 6, we depict how different parameter combinations impact the accuracy of our K-Means clustering.Ultimately, parameter values ranging from 10 −10 to 10 10 were found to produce stable results.

Visualization of tPCA Eigen-Gene
In the context of scRNA-seq clustering, an eigen-gene refers to the principal components produced by our dimensionality reduction.Eigen-genes summarize the gene expression patterns within a cluster given that they are linear sums of the features which explain the most variation in that cluster.By reducing our data to two dimensions via UMAP or tSNE, we can visualize our eigen-genes in a 2D plot.This visualization can identify the clusters with similar or distinct gene expression profiles.
For a dataset containing, say, 20,000 genes, an aggressive reduction to k = 2 dimensions typically results in poorly maintaining the integrity of the data, leading to ineffective visualizations of the eigen-genes.Thus, an important step in the data visualization process is pre-processing.If we first reduce our data to, say, k = 50 dimensions via PCA before then reducing again to k = 2 dimensions via tSNE or UMAP, we should see an improvement in the representation of our clusters.However, given the associated weaknesses with traditional PCA that we have discussed previously, there may be additional benefit to be gained from preprocessing the data with Topological PCA instead.In Figures 7, 8, 9 we compare the 2D visualizations for several of the tested datasets when pre-processing via PCA and tPCA to assess this improvement in terms of visualization and potential biological insights.In Figure 7, we compare visualization techniques for GSE84133mouse1.In Veres et al, β cells were found to have heterogeneity between two distinct subpopulations [46].However, traditional PCA-enhanced tSNE separates the subpopulations into two clusters that are far away and considerably mixed with δ and other cell types.Our method manages to visualize the cells more similarly, while still displaying the genetic heterogeneity in the population.Furthermore, there is considerably improved separation between the β cells and other cell types, particularly δ cells.For UMAP, we observe that PCA-enhanced UMAP clusters all cell types into two relatively homogeneous clusters.Pre-processing with kNN-tPCA, meanwhile, manages to separate all cell types fairly well, with the exception of Endothelial and Quiescent Stellate cells.This is likely explained by quiescent stellate cells being located primarily around vascular cells in the pancreas, including Endothelial cells, leading to similar gene expression profiles between the two cell types.Gaining a further understanding of this spatial organization is crucial for understanding the mechanisms underlying pancreatic diseases.In Figure 8, we compare visualization techniques for GSE82187.For tSNE as well as UMAP, we note improved separation between Astro, Ependy-C, and OPC cells when pre-processing with tPCA rather than traditional PCA.Furthermore, like with traditional PCA-enhanced UMAP and tSNE, kNN-tPCA preprocessing still enables us to identify the distinct D1 and D2 medium spiny neuron subtypes even when inducing sparsenss in our principal components.In both of our improved visualizations, there seems to be more of a continuous gradient between the subtypes rather than a discrete separation.Continuous gradients indicate that neurons within each subtype lie on a spectrum of gene expression values, with many cells having a range of intermediate expression values.These results are supported by the findings in Gokce et al [44].In Figure 9, we compare visualization techniques for GSE75748cell.We note for both tSNE and UMAP, the PCA pre-processed version clusters H1 and H9 cells into one homogeneous cluster given the similar gene expression profile of these cells [43].However, the kNN-tPCA enhanced versions were still able to differentiate these cell types.The same can be said of DEC and EC cells.DEC cells were also found to have lower similarity in their clustering with kNN-tPCA enhanced tSNE, indicating a heterogeneous pool of DEC cells.These results are supported by the findings in Chu et al [43].In both instances when pre-processing with tPCA, the H9 cells formed two distinct clusters, indicating some kind of possible heterogeneity in the genetic profiles of these cells.

RS Plot Analysis
To more effectively visualize our gene expression data after dimensionality reduction, we can generate Residue-Similarity plots for some of the tested datasets [21].We can then compare results for classifying cell types after reducing the data via RgLSPCA and kNN-tPCA.In Figure 10 we produce RS plots for each method on GSE82187 to compare classification accuracy.We observe a significant improvement, particularly in identifying the cell types in panels two and eight, or Ependy-C and Vascular cells respectively.Note specifically that for Ependy-C cells the samples are situated in the top-right corner, indicating a significantly improved cluster boundary separation and inter-cluster similarity in that clustering when utilizing persistent Laplacian regularization.

RgLSPCA kNN-tPCA GSE82187
Figure 10: RS plots of clusters generated from RgLSPCA and kNN-tPCA based dimensionality reduction.The x-axis is the residual score, and the y-axis is the similarity score.Each section corresponds to one cluster and the data were colored according to the predicted labels from kNN on the GSE82187 dataset at k = 100.
Similarly, for GSE67835 we note a considerable improvement in our ability to correctly identify replicating fetal neurons and Microglia in panels five and six respectively.For Microglia cells in particular, we again observe a significant improvement in the residual score for that clustering, indicating that tPCA yields a greater dissimilarity between these cells and other cell types than RgLSPCA.Specifically, tPCA improves the separation between Microglia and quiescent fetal neurons/OPC cells.In panel one, we note that classification after dimensionality reduction via kNN-tPCA has a slightly greater tendency to misidentify OPC cells with lower similarity scores as Microglia cells, indicating that these cells exhibited a similar gene expression profile, which is supported by the findings in Darmanis et.al. for a subset of the OPC population [42].

kNN-tPCA RgLSPCA GSE67835
Figure 11: RS plots of clusters generated from RgLSPCA and kNN-tPCA based dimensionality reduction.The x-axis is the residual score, and the y-axis is the similarity score.Each section corresponds to one cluster and the data were colored according to the predicted labels from KNN on the GSE67835 dataset at k = 100.

Conclusion
Single Cell RNA sequencing technologies have grown considerably in popularity in recent years, and with the ability to reveal vast amounts of information regarding the drivers behind various diseases as well as potential bio-therapeutic targets, effective analysis of this data is of paramount importance in the field of biomedical research.As we have seen, intrinsic high dimensionality of the data introduces computational complexity as well as considerable noise, hindering any meaningful analysis.Thus, dimensionality reduction is a crucial step of the process, and we seek, as always, to maximize the accurate representation of our data in the new, reduced space.To this end, we propose topological PCA for scRNA-seq clustering.This method combines a new robustness via L 2,1 norm regularization, sparsity constraints, and improved geometrical structure capture via persistent Laplacian regularization.
Extensive benchmark testing on 11 scRNA-seq datasets showcases that our proposed method significantly outperforms other similar PCA enhancements, as well as non-negative matrix factorization, for KMeans clustering after dimensionality reduction.While previous methods such as graph Laplacian Sparse PCA account for sparsity and local geometry preservation, the method is limited by analysis of a simplicial complex at only a single scale.Furthermore, Frobenius norm regularization is sensitive to outliers.The incorporation of a persistent Laplacian term contributes to multi-scale analysis through a sequence of filtrations, as well as persistent homology information derived from the harmonic spectra of our Laplacian matrices.Compared to NMF, we observe an average improvement of 13.53% for NMI and 31.92% for ARI.
While our method achieves superb results for clustering analysis after dimensionality reduction, there is still considerable room for improvement.First, our method considers only L 0 Laplacian, and therefore lacks higher order connectivity information.Furthermore, there remains the work of continuing our parameter analysis, to arrive at a means of optimizing our {ζ} weights which is more efficient and produces more optimal results than simple grid search.While kNN-induced persistent Laplacians seem to be less dependent on parameter tuning, there is still added benefit to examining means of optimizing the performance, and so we should hope to achieve this in a more efficient manner.

Data and Model Availability
The data and model used to produce these results can be obtained at the Single Cell Data Processing and RpLSPCA scRNA-seq GitHub Repositories: Topological PCA GitHub repository: https://github.com/seanfcottrell/Topological-PCASingle Cell Data Processing GitHub repository: https://github.com/hozumiyu/SingleCellDataProcess

Figure 1 :
Figure 1: Comparison of inducing filtration via Vietoris Rips Complex, which is reliant on a chosen distance threshold ϵ, and k-nearest-neighbors induced filtration, which is induced by varying the number of nearest neighbors at each node in our graph.

Figure 2 :
Figure 2: NMF and ARI comparisons for each method averaged over all datasets.

Figure 3 :
Figure 3: Distributions of ACC and F1 performance for PCA, RgLSPCA, and kNN-tPCA as we vary the number of reduced subspace dimensions.

Figure 4 :
Figure 4: ACC, PRE, REC, and F1 comparisons for each methods averaged over all datasets.

Figure 5 :
Figure5: Comparison of the scales of parameter searches necessary between the limited approach to distance based filtrations and the limited approach to kNN-Induced filtrations.

Figure 6 :
Figure 6: Variations in KMeans accuracy for different combinations of γ and β parameter values

Figure 7 :
Figure 7: Comparison of visualization techniques between PCA-enhanced tSNE and UMAP, and kNN-tPCA-Enhanced tSNE and UMAP for GSE84133mouse1.Data was log-transformed, with low variance genes removed.For kNN-tPCA-Enhanced tSNE and UMAP, ζ weights were chosen universally as {ζt} = {1/t} for the tth filtration, and data was reduced to k = 50 dimensions.Cells are color coded according to true cell types provided by original authors.Labels 0 through 12 correspond to B cells, T cells, Activated Stellate, α cells, β cells, δ cells, Ductal cells, Endothelial cells, γ cells, Immune cells (other), Macrophage cells, Quiescent Stellate, and Schwann cells respectively.

Figure 8 :
Figure 8: Comparison of visualization techniques between PCA-enhanced tSNE and UMAP, and kNN-tPCA-Enhanced tSNE and UMAP for GSE82187.Data was log-transformed, with low variance genes removed.For kNN-tPCA-Enhanced tSNE and UMAP, ζ weights were chosen universally as {ζt} = {1/t} for the tth filtration, and data was reduced to k = 50 dimensions.Cells are color coded according to true cell types provided by original authors.Labels 0 through 9 correspond to Astro cells, Ependy-C cells, Ependy-Sec cells, Macrophage cells, Microglia cells, NSC cells, Neuron cells, OPC cells, Oligo cells, and Vascular cells respectively.

Figure 9 :
Figure 9: Comparison of visualization techniques between PCA-enhanced tSNE and UMAP, and kNN-tPCA-Enhanced tSNE and UMAP for GSE75748cell.Data was log-transformed, with low variance genes removed.For kNN-tPCA-Enhanced tSNE and UMAP, ζ weights were chosen universally as {ζt} = {1/t} for the tth filtration, and data was reduced to k = 50 dimensions.Cells are color coded according to true cell types provided by original authors.Labels 0 through 6 correspond to DEC cells, EC cells, H1 cells, H9 cells, HFF cells, NPC cells, and TB cells respectively.

Table 1 :
Accession ID, source organism, and the counts for samples, genes, cell types and normalization for 11 datasets 3.2.1 Adjusted Rand Index (ARI)

Table 2 :
Comparison of tPCA and other methods for Adjusted Rand Index

Table 3 :
Comparison of tPCA and other methods for Normalized Mutual Information

Table 4 :
Comparison of tPCA and other methods for each performance metric averaged over all tested datasetsWe see from the final results in Table4that, on average, tPCA outperforms NMF by a significant measure of 31.92% for ARI and 13.53% for NMI, and RgLSPCA by 3.78% for NMI and 5.92% for ARI.kNN-tPCA, meanwhile, outperforms NMF by 31.05% for ARI and 12.58% for NMI, and RgLSPCA by 2.91% for NMI and 5.22% for ARI.To intuitively illustrate this point, in Figure2we provide a barplot comparing the performance metrics of the mentioned procedures averaged over each of the 11 tested datasets.

Table 5 :
Comparison of average results for kNN-tPCA and Other Methods for Classification after dimensionality reduction to m = 1, 10, ..., 100

Table 6 :
Comparison of average results for kNN-tPCA and Other Methods for kNN Classification after dimensionality reduction to m = 1, 10, ..., 100

Table 7 :
Comparison of kNN-tPCA and other methods for each performance metric averaged over all tested datasets